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Abstract 


We present a randomized algorithm sorting n integers 
in O(nJloglogn) expected time and linear space. This 
improves the previous O(n log log n) bound by Anderson et 
al. from STOC’9S. 

As an immediate consequence, if the integers are 
bounded by U, we can sort them in O(n,/log log U) 
expected time. This is the first improvement over the 
O(n log log U) bound obtained with van Emde Boas’ data 
structure from FOCS’75. 

At the heart of our construction, is a technical determin- 
istic lemma of independent interest; namely, that we split n 
integers into subsets of size at most \/n in linear time and 
space. This also implies improved bounds for deterministic 
string sorting and integer sorting without multiplication. 


1 Introduction 


Integer sorting has always been an important task in con- 
nection with the digital computer. A classic example is 
the folklore algorithm radix sort, which according to Knuth 
[28] is referenced as far back as in 1929 by Comrie in a 
document describing punched-card equipment [14]. 

Whereas radix sort works in linear time for O(log n)-bit 
integers, it was not until 1990 that Fredman and Willard [17] 
beat the Q(n log n) comparison-based sorting lower-bound 
for the case of arbitrary single word integers. The word-size 
W is determined by the processor. We assume W > logn 
so that we can address the different integers, but in princi- 
ple, W can be arbitrarily large compared with n. An equiv- 
alent formulation of our assumptions is that we only assume 
constant time operations on integers polynomial in the sum 
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of the input integers. The assumption that each integer fits 
in a machine word implies that integers can be operated on 
with single instructions. A similar assumption is made for 
comparison based sorting in that an O(n log n) time bound 
requires constant time comparisons. However, for integer 
sorting, besides comparisons, we can use all the other in- 
structions available on integers in standard imperative pro- 
gramming languages such as C [26]. It is important that the 
word-size W is finite, for otherwise, we can sort in linear 
time by a clever coding of a huge vector-processor dealing 
with n input integers at the time [30, 27]. 

Concretely, Fredman and Willard [17] showed that we 
can sort deterministically in O(n log n/ log log n) time and 
linear space and randomized in O(n./log n) expected time 
and linear space. The randomized bound can also be 
achieved deterministically using space unbounded in terms 
of n (of the form 2° for constant ¢ > 0), but here we focus 
on bounds with space bounded in terms of n. 

In 1995, Andersson et al. [5] improved Fredman and 
Willard’s O(n./log n) expected time for integer sorting to 
O(n log log n) expected time. Both of these bounds use lin- 
ear space. A similar result was found independently by Han 
and Shen [23]. 

The above mentioned bounds are the best unrestricted 
bounds in the sense that no better bounds are known even if 
besides the randomization, we have unlimited space and are 
free to define our own operations on words. 

The above results have provided great inspiration for 
many researchers, trying to improve them in various ways. 
For example, there has been work on avoiding randomiza- 
tion [4, 21, 22, 31, 33] and there has been work on avoiding 
multiplication [3, 6, 36]. Also there has been lots of work 
on more dynamic versions of the problem such as priority 
queues [13, 18, 31, 33, 34] and searching [3, 4, 8, 9]. 

The O(n log log) algorithm of Andersson et al. from 
1995 [5] is very simple, and the fact that it has sustained so 
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much interest has lead many researchers to think that this 
could be the complexity for integer sorting, like O(n log n) 
is the complexity of comparison based sorting. 

However, in this paper, we improve the O(n log log n) 
expected time to O(n,/log log n) expected time: 


Theorem 1 There is a randomized algorithm sorting n in- 
tegers, each stored in one word, in O(nV/loglogn) ex- 


pected time and linear space. 


We leave open the problems of getting a corresponding de- 
terministic algorithm and avoid the use of multiplication. 
We note that the dynamic aspects are already pretty settled, 
for the complexity of dynamic searching is known to be 
O(./log n/ log log n) [8, 9], and priority queues have once 
and for all been reduced to sorting [35]. 

Since integers of O(log n) bits can be sorted in linear 
time using radix sort, we get the following immediate con- 
sequence of Theorem 1: 


Corollary 2 We can sort n integers of size at most U in 


O(nvVlog log U) expected time. 


This is the first improvement over the O(n log log U) bound 
obtained with van Emde Boas’ data structure from 1975 
[37]. Indeed, the O(n log log n) sorting algorithm of An- 
dersson et al. [5] combines van Emde Boas’ data structure 
with the packed merging of Albers and Hagerup [2] so as to 
match the O(n log log U) bound for large values of U. 

We note here that the general O(log log U) bound of 
van Emde Boas [37] has been improved in the context 
of static search structures. More precisely, Beame and 
Fich [9] have shown that one can preprocess a set of 
nm integers in polynomial time and space so that given 
any x, one can search the largest stored integer below 
x in O(loglogU/logloglogU) time. However, due to 
the polynomial construction time, this improvement does 
not help with sorting. Beame and Fich [9] combine 
their result with Andersson’s exponential search trees from 
[4], giving a dynamic search structure with an amor- 
tized update time of O(log log U log log n/ log log log U). 
This dynamic search structure could be used for sort- 
ing in O(n log log U log log n/ log log log U) time, but this 
is never better than the best of O(nlogn) time and 
O(n log log U) time. Thus, from the perspective of sorting, 
our O(nv/loglogU) bound constitutes the first improve- 
ment over the O(n log log U) bound derived from van Emde 
Boas’ data structure from 1975 [37]. 

We will, in fact, prove the following refinement of Corol- 
lary 2: 


Theorem 3 There is a randomized algorithm sorting n 
word integers, each of size at most U < 2” in 


O(n, /log ee) expected time and linear space. 
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Theorem 3 improves a corresponding refinement of Kirk- 
patrick and Reich [27] of van Emde Boas’ bound of 
O(n log ret) expected time. 

The following simple lemma states that it suffices for us 
to prove Theorem 3: 


Lemma 4 Theorem 3 implies Theorem I and Corollary 2. 


Proof: Trivially Theorem 3 implies the weaker Corol- 
lary 2. However, Andersson et al. [5] have shown that we 
can sort in linear expected time if W > (log n)?, and other- 
wise, loglogU < logW < 3loglogn. a 


1.1 Other domains 


In this paper, we generally assume that our integers to be 
sorted each fit in one word, where a word is the maximal 
unit that we can operate on with a single instruction. In this 
subsection, we briefly discuss implications of this case to 
many other domains. 

First, consider the case of lexicographically ordered 
strings of words. Andersson and Nilsson [7] have presented 
an optimal randomized reduction from this case to that of 
integers fitting in single words. Applying their reduction to 
Theorem 1, we get 


Corollary 5 We can sort n variable-length strings dis- 
tributed over N words in O(nv/loglogn + N) expected 
time. In fact, we can get down to O(nv/log log n + L) ex- 
pected time where L = 5°, 6; and &; is the length of the 
distinguishing prefix of string 1, that is, the smallest prefix 
of string i distinguishing it from all the other strings, or the 
length of string 1 if it is a duplicate. 


We note here that any algorithm sorting the strings will have 
to read the distinguishing prefixes, and since it takes an in- 
struction to read each word, it follows that an additive O(L) 
is necessary. 

One may instead be interested in variable length 
multiple-word integers where integers with more words are 
considered bigger. However, by prefixing each integer with 
its length, we reduce this case to lexicographic string sort- 
ing. 

In our presentation, we think of all our integers as un- 
signed non-negative integers. However, the standard repre- 
sentation of signed integers is such that if we flip the sign- 
bit, we can sort them as unsigned integers, and then flip 
the sign-bit back again. Floating points numbers are even 
easier, for the IEEE 754 floating-point standard [25] is de- 
signed so that the ordering of floating point numbers can 
be deduced by perceiving their representations as multiple 
word integers. Also, if we are working with fractions where 
both enumerator and denominator are single word integers, 
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we get the right ordering if for each fraction, we make the 
division in floating point numbers with double precision. 
Now we get the correct ordering of the original integer frac- 
tions by perceiving the corresponding floating point num- 
bers as integers. 


1.2. Machine model 


Recall that our machine model is a normal computer with 
an instruction set corresponding to what we program in a 
standard programming language such as C [26] or C++ [32]. 
We have a processor determined word-size W, limiting how 
big integers we can operate on in constant time. We assume 
that each input integer fits in a single word. We note that 
for generic code, the type of a full word integer, e.g. Long 
long int, should be a macro parameter in C or template 
parameter in C++. We adopt the unit-cost time measure 
where each operation takes constant time. 

Interestingly, the traditional theoretical RAM model of 
Cook and Reckhow [15] allows infinite words. A disturb- 
ing consequence of infinite words is that with normal op- 
erations such as shifts or multiplication, we can simulate 
an exponentially big parallel processor solving all problems 
in NP in polynomial time. Hence such operations have to 
be banned from the above unit-cost theory RAM, making it 
even more contrived from a practical view-point. 

However, by adopting the real-world limitation of a lim- 
ited word-size, we both resolve the above theoretical issue, 
and we get algorithms that can be implemented in the real 
world. Hagerup [19] has named this model the word RAM. 
The word RAM has a fairly long tradition within integer 
sorting, being advocated and used in the 1984 paper [27] by 
Kirkpatrick and Reisch, and in the seminal 1993 paper of 
Fredman and Willard [17]. 

We note that our unit-cost multiplication may be con- 
sidered somewhat questionable in that multiplication is not 
in AC° [10], that is, there is no multiplication circuit of 
constant-depth and of size polynomial in W. We leave 
it as an open problem to improve Thorup’s randomized 
O(n log log n) expected time and linear space sorting with- 
out multiplication [36]. 

We will now discuss some of the (delightfully) dirty 
tricks that we can use on the word RAM, and which are 
not allowed in the comparison based model or on a pointer 
machine. 

A first nice feature of the word RAM over the compar- 
ison based model is that we can add and subtract integers. 
For example, we can use this to code multiple comparisons 
of short integers packed in single words. The idea of mul- 
tiple comparisons was first introduced by Paul and Simon 
[30] in 1980. It should be noted that this use of uniproces- 
sors as vector processors is a standard trick in practice, not 
in connection with sorting, but in connection with graphics, 
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where a single word operation is used to manipulate the in- 
formation on several pixels, each represented by one byte 
of the word. 

The word RAM model distinguishes itself both from the 
comparison based model and from the pointer machine in 
that we can use integers, and segments of integers, as ad- 
dresses. This trick goes back to radix sort where an integer 
is viewed as a vector of characters, and these characters are 
used as addresses. Another word RAM trick in this direc- 
tion is that we can hash integers into smaller ranges. Here 
radix sort goes back at least to 1929 [14] and hashing goes 
back at least to 1956 [16], both being developed for efficient 
problem solving in the real world. 

Fredman and Willard [17] further use the RAM for ad- 
vanced tabulation of complicated functions over small do- 
mains. Their tabulation is too complicated to be of practical 
relevance, but tabulation of functions is in itself commonly 
used to tune code. As a simple example, Bentley [11, pp. 
83-84] suggests that an efficient method for computing the 
number of set bits in 32-bit integers is to have a preprocess- 
ing where we first tabulate the number of set bits in all the 
256 different 8-bit integers. Now, given a 32-bit integer x, 
we view it as the concatenation of four 8-bit integers, and 
for each of these, we look up the number of set bits in our 
table. Finally we just add up these four numbers to get the 
number of set bits in x. 

Summing up, we have argued that the “dirty tricks” facil- 
itated by the word RAM are well established in the practice 
of writing fast code. Hence, if we disallow these tricks, we 
are not discussing the time complexity of running impera- 
tive programs on real world computers. At the same note, 
it should be admitted that this is a theory paper. The al- 
gorithms presented are too complicated and have too large 
constants hidden in the O-notation to be of any immediate 
practical use. This does not preclude that some of the ideas 
may find use in practice. For example, Nilsson [29] has 
demonstrated that the O(n log log n) algorithm of Anders- 
son et al. [5] can be implemented so as to be competitive 
with the best practical sorting algorithms. 


1.3 Deterministic splitting and sorting 


The heart of our construction is a deterministic splitting 
results of independent interest. 


Definition 6 A splitting of an ordered set X is a partition- 
ing into sets Xp < X1 < ++: < Xx. Here, A < B denotes 
that a < b forall (a,b) € Ax B. 


We are generally thinking of sets as multisets. If all ele- 


ments of a set are identical, we call it a duplicate set; other- 
wise, we call it a diverse set. 
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Theorem 7 We can split a set of n word integers in linear 
time and space so that each diverse subset has at most \/n 
integers. 


Applying Theorem 7 recursively, we immediately get that 
we can sort deterministically in O(n log log n) time and lin- 
ear space, but this has already been proved by Han in [22]. 
However, for deterministic string sorting, we get the follow- 
ing new result which does not follow from [22] (the general 
reduction of Andersson and Nilsson [7] from word sorting 
to string sorting is randomized): 


Corollary 8 We can sort n variable-length strings dis- 
tributed over N words in O(n log logn + N) time and lin- 
ear space. In fact, we can get down to O(n log log n + L) 
time where L is the sum of the lengths of the distinguishing 


prefixes. 


Proof: To get the corollary, we simply apply Theorem 
7 recursively but only to the first unmatched word of each 
string. That is, our recursive input is a subset of the integers 
with some common matched prefix. In the root call, the sub- 
set is the complete set with nothing matched so far. Integers 
ending up in a duplicate set match in one more word, and 
the other integers end up in sets of size reduced to the square 
root. Since the splitting takes constant time per integer, we 
pay constant time per word matched and O(log log n) time 
for reductions into smaller sets. | 


We also present variants of the splitting using only stan- 
dard AC® operations, that is, AC° operations available via a 
standard programming language such as C or C++. 


Theorem 9 For any positive €, using standard AC° oper- 
ations only, we can split a set of n (W/(log logn)!**)-bit 
integers in linear time so that diverse subsets have size at 


most \/n. 


Corollary 10 For any positive ¢, using standard AC® op- 
erations only, we can sort n words in O(n(log log n)'**) 
time and linear space. 


Proof: We use the same proof as for Corollary 8, but 
viewing each word as a string of (W/(log log n)!**)-bit 
characters, to which we can apply Theorem 9. | 

Corollary 10 improves the previous 
O(n(log log n)?/(logloglogn)) bound of Han from 
[20], and it gets very close to the best multiplication based 
deterministic bound of O(n log log n) [22]. 


1.4 Contents 
The rest of the paper is divided into two sections. Section 


2 presents our randomized sorting assuming deterministic 
splitting, and Section 3 presents our deterministic splitting. 
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2 Fast randomized sorting 


For our fast randomized sorting, we need a slight variant 
of the signature sorting of Andersson et al. [5] (cf. Appendix 
A.1): 


Lemma 11 With an expected linear-time additive over- 
head, signature sorting with parameter r reduces the prob- 
lem of sorting n integers of £ bits to 


(i) the problem of sorting n reduced integers of 4r logn 
bits each and 


(ii) the problem of sorting n integer fields of ¢/r bits each. 
Here (i) has to be solved before (ii). 


We are now going to present our randomized algorithm for 
sorting n word integers in linear space and O(n,/log p) ex- 
pected time where p = log U/logn. 


Repeated splitting First we apply our splitting from The- 
orem 7 to recursively split the diverse set of integers 
[/log p] times, thus getting a splitting with diverse subsets 


: Vlog p : esos : 
of size at most n’ = n\/? . Each integer is involved in 


at most [./log p] linear time splittings, so the total time is 


O(nv/log p)). 


Repeated signature sort To each of the diverse sets 5; of 
size at most n', we apply the signature sort from Lemma 11 
with r = Viog p, Then the reduced integers from (i) have 
4r logn' = O(log n) bits. 

We are going to sort these reduced integers from all the 
subsets S; together, but prefixing reduced integers from S; 
with 7. The prefixed reduced integers still have O(log n) 
bits, so we can radix sort them in linear time. From the 
resulting sorted list we can trivially extract the sorted sub- 
list of reduced integers for each S;, thus completing task (i) 
from Lemma 11. 

We have now spent linear expected time on reducing 
the problem to dealing with the fields from Lemma 11 (ii) 
and the field length is only a fraction 1/ 2VicgP of original 
length. 

We repeat this signature sorting [,/log p] times, at a total 
expected cost of O(n/log p). We started with integers of 
length log U, and each round reduces the integer length by a 


factor 2V!°8?, so we end up with integers of length at most 
log U/(2V1°8?)V18P < log U/2'°8? = logn 


Since the lengths are now at most logn, we can trivially 
finish with a linear time bucket sort. 
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Summing up The total expected time spent in the above 
algorithm is O(n/logp), and we have only used linear 
space, as desired. 

Thus, our only remaining task is to provide the splitting 
from Theorem 7, that is, 


Lemma 12 Theorem 7 implies Theorem 3. | 


3 Deterministic splitting 


To prove Theorem 7, first in Section 3.1, we reformulate 
it in terms of splitting over a given set of splitters. 


3.1 Splitting over splitters 


We will actually prove our splitting result in terms of 
an equivalent formulation. A splitting of a set X into sets 
Xo < X1 < +++ < Xx is a splitting over k splitters y, < 
Yy2 <st+ < Yk if Xo < {yi} <X1 < {yo} < X3°++< 
{yn} < Xx. 


Lemma 13 The following statements are equivalent: 


(a) We can split a set of n integers so that diverse subsets 
are of size at most n+~*¢ in linear time and space for 
some positive constant Eq. 


(b) We can split a set of n integers so that diverse sets are 
of size at most n'—*> in linear time and space for any 
positive constant Ep. 


(c) We can split a set of n words over n* splitters in linear 
time and space for any positive constant €¢. 


(d) We can split a set of n words over n*¢ splitters in linear 
time and space for some positive constant €4q. 


Proof: (a)=(b) To get diverse subsets of size n!—* we 
apply (a) recursively on diverse sets. If (a) is applied z times 
to subsets containing zx, the set containing x ends up non- 


diverse, or of size at most n\—£«)", Thus « gets involved at 
most log,_., (1 — €») = O(1) times. 


(b)=>(c) To prove (c) with any given value of €,., we apply 
(b) with €, = 2e,. Now, the diverse subsets have at most 
née integers. Out of these subsets, at most n*° contain a 
splitter. We split each subset with splitters using traditional 
comparison based sorting. The total number of integers in- 
volved in this is at most n!~?*en®« = n!—€<, so the total 
time for sorting is O(n+~*< logn) = O(n). Having sorted 
all diverse subsets with splitters over these splitters, we im- 
mediately derive the desired splitting of the original set. 


(c)=(d) _ is trivial. 
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(d)=>(a) Ifn = O(1), (a) is trivial, so assume n = w(1). 
We divide our input integers into batches of size a,/n. Us- 
ing (d), we can split such a batch over n°¢/? splitters in 
linear time. 

We will develop the an appropriate set Y of splitters as 
we go along, starting with Y = . Each time we come with 
a batch of integers, we split them according to the current 
splitters yy < yy < +++ < yp_1, adding them to the splitting 
Xo < {yi} < Xi < fyo}--» < {yx} < X% done so 
far. If one of the diverse subsets X; gets 4n!—£2/? or more 
elements, we split it according to its median z, and make z 
and z+ 1 two new splitters. Obviously, we end with at most 
4n'—<a/4 splitters in each diverse set, and since n = w(1), 
4n'—<a/4 < n!~£ for some positive constant €,. 

We will charge each splitting over a median to 2n 
elements. This implies that the median finding and sub- 
sequent splitting is done in linear time, and that the total 
number of splitters is at most 2n/(2n!~4/2) = n&e/?, as 
needed for applying (d) with a batch of ./n integers. 

The charging is simple: every time a diverse subset is 
started, it has at most 2n!~©¢/? charged elements. We only 
split the set if it gets more than 4n!—*«/? elements, which 
means that we have at 2n!~£«/? non-charged elements that 
we can charge, and we can easily distribute the charging so 
that each of the resulting divers sets get at most Moreover, 
any diverse subset starts with at most 2n!~£/? charged el- 
ements. a 


l-eq/2 


The rest of this paper is devoted to prove Lemma 13 (d) 
with €g = ¢/n, which by Lemma 13 (b) implies Theorem 
7. That is, our remaining task is to split n integers over ¢/n 
splitters in linear time and space. 


3.2 Deterministic signature sorting 


We shall use a deterministic version of signature sort, 
essentially done by Han [21] (cf. Appendix A.2). 


Lemma 14 With an O(n+<4s*) time additive overhead, de- 
terministic signature sorting with parameter r reduces the 
problem of splitting n €-bit integers over s splitters to 


(i) the problem of splitting n reduced integers of 4r logn 
bits over s reduced splitters, and 


(ii) the problem of splitting n fields of ¢/r bits over s field 
splitters. 


Here (i) has to be solved before (ii). 


Han [21] actually paid an additive O(¢s”). The fact that we 
only pay O(£s?) is useful if @ is arbitrarily large compared 
with n as in Lemma 15 below. 


Lemma 15 [f W = (logn)°, c > 5, we can split n word 
integers over at most </n splitters in linear time. 
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Proof: | We apply the signature sorting of Lemma 14 with 
r = (logn)°—?. Then the additive overhead is 


O (n + Toes We) = O(n). 


Now, the reduced integers from (i) have length 
4r (log n) = 4(logn)°~? = O(W/(log n)”) 


and that implies that we can sort and hence split them 
in linear time using the packed sorting of Albers and 
Hagerup [2]. Similarly, the fields have length (logn)? = 
O(W/(log n)*), so they can also be sorted in linear time. 
| 


Above, we could let c go down from 5 to 3 if, instead of 
using the packed sorting of Albers and Hagerup, we used 
the one by Han and Shen [24] which takes linear time for 
O(W/ log n)-bit integers. However, Han and Shen’s result 
employs the sorting network of Ajtai et al. [1], and the im- 
provement does not affect the overall results of this paper. 


Lemma 16 /f W < (logn)®, with a linear time additive 
overhead, we can reduce the problem of splitting n word 
integers over s < %/n splitters into at most four problems 
of splitting q(log n)-bit integers over s splitters where q = 
O(log n) and W = Q(q4 (log n)). 


Proof: Let p = W/logn < (logn)*. First we apply 
the deterministic signature sort of Lemma 14 with r = /p. 
Now both subproblems have integer length ,/p(logn). We 
then apply Lemma 14 with r = ¥p to each subprob- 
lem, getting four subproblems, each with integer length 
O(¥/p(log n)). rT] 
By the two preceeding lemmas, we can assume that the 
integers are of length q(logn) where g = O(logn), W = 
Q(q*(logn)), and W < (logn)°. 


3.3 String sorting 


We are going to show: 


Lemma 17 Consider n > W® integers packed with at most 
k (log k) integers in each word. We can sort the integers ac- 
cording to the value of a given segment of at most (log n) /2 
consecutive bits so that the time spent on an integer with 
segment value c is 


1+1 —1 

O ( +logn — log ne n 1/t) 
logn 

where n¢- is the number of integers with segment value c. 


The above lemma may look somewhat strange, but as 
demonstrated below, it actually implies our main result. 


Lemma 18 Lemma 17 implies Lemma 13 (d) with eg = 


on. 
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Proof: We need to split n integers over ¢/n splitters. 
From Lemmas 15 and 16, we know that it suffices to con- 
sider q(log n)-bit integers where gq = O(logn), W = 
Q(q‘(logn)), and W < (logn)°. For n = w(1), the latter 
implies W® < n, as required for Lemma 17. 

We view each integer as consisting of 4q characters, each 
of (logn)/4 bits. Also, we have plenty of room to pack 
q(log q) integers in each word. 

The algorithm works recursively, taking a subset of n’ > 
s/n integers, all with a common prefix of length 7. We then 
use Lemma 17 with k = q to sort the integers according to 
character i + 1. We note that this character has (log n)/4 < 
(log n’)/2 bits, as required of the segment in Lemma 17. 
Also, we note that this is, in fact, a splitting since all the 
integers agree on the preceding characters. Starting with all 
n integers, we recurse on diverse subsets until they all have 
size at most «/n. 

To see that this implies a linear time splitting, let 
0,11, ..-,N¢ be the sizes of the sets that a given integer 
a is involved in as. That is, we start with no = n. For 
round i, we have nj_1 > /n integers, and we match 
(logn)/4 < (log n;_1)/2 bits of x, finding agreement with 
n, other integers. Hence, the cost for « is 


1+ lognj;_; — logn; 
O | ———-—_ + 1 
( log Vn ve 


Summing this for 2 = 1, ...,t, we get a total cost for « of 


log no — log nz 
O | ———— + t/log Vn +t : 
Here no = n so the first term is constant. Moreover, by 
definition, there are 4g = O(logn) characters, and using 
this upper-bound on ¢, we see that the last two terms are 
constant. a 


3.4 Sorting over a segment 


The goal of this section is to prove Lemma 17. We 
will use the lemma below on packed bucketing, essentially 
proved by Han in [21] (cf. Appendix A.3). 


Lemma 19 Consider n integers packed with k(log k) inte- 
gers in each word, and that an €-bit label for each integer 
is packed in a parallel set of words. We can then sort the 
integers according to their labels in O(£/ logn + 1/k) time 
per integer. 


Lemma 20 Consider n > W® integers packed with at least 
k(log k) integers in each word. We can group all integers 
with respect to matches with t < </n target integers within 
a given segment in 


O (log(t + 1)/logn + 1/k) 
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time per integer. Integers not matching any target integer 
end in one group. 


Proof: Our goal is to construct a parallel set of labels so 
that we can apply Lemma 19. First we assume that t > 2. 

In O(t?W) time, we construct a perfect hash function 
from the segments of the targets into [2logt] bits. Since 
k = O(W), the time spent is O(n/k). 

To apply the hash function to all our integers, we first 
make copies of the words containing them into a parallel 
set of words, then we mask out the segments, shifting them 
to the least significant part of each integer. Finally, we ap- 
ply the hash function so that we now have a parallel set of 
[2logt] bit labels, each aligned with the least significant 
part of its original integer. We generally refer the reader to 
[5, pp. 77-79] for details on making such word operations 
on multiple integers in a word, including the hashing. We 
only spend constant time per word, hence O(1/(k log k)) 
time per integer. 

We now apply Lemma 19, getting the integers sorted ac- 
cording to their labels in O(log t/ login + 1/k) time per in- 
teger. We have O(t”) labels, and exactly one label for each 
target. Fix ag to be a label which is not a label of a target. 
For each target y, we pack k(log k) copies of y in a word 
y*. This takes O(tk(log k)) = O(n/k) total time. 

The integers in the words are sorted according to their la- 
bels, and comparing this with the sorted list of target labels, 
we can easily identify words of integers where all or some 
of the parallel labels match some target label. 

For each word all of whose labels match the label of a tar- 
get y, we compare the parallel word of integers with y*. For 
each non-match, the corresponding label is replaced with 
ao. All this takes constant time per word. For words having 
no target labels, all labels are replaced by ag. There can be 
at most 2¢ words with only some labels being target labels. 
These have at most 2tklogk = O(n/k) parallel integers, 
and for each such integer, we can trivially, in constant time, 
replace its label with ag if it doesn’t match a target. 

We now re-apply Lemma 19, getting the integers sorted 
according to their revised labels. However, this time the 
sorting gives the desired grouping. The segment of integers 
with label ag are exactly those that do not match any target. 

Finally, we have two special cases of ¢ = 0,1. The 
case t = 0 is trivial in that all integers belong in the non- 
matching group, and hence nothing has to be done, agree- 
ing with log(t + 1) = 0 in the time bound. For ¢ = 1, we 
copy the unique target so that all integers in a word can be 
matched in constant time. We use a 1-bit label with 1 for 
match and 0 for non-match, and finally apply Lemma 19. 
Since log(¢ + 1) = 1, this again gives us the desired time 
bound. a 


The lemma below essentially shows that if we had guessed 
the frequencies, we would be done. 
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Lemma 21 Consider n > W® integers packed with 
k(log k) integers in each word. We are focusing on a spe- 
cific segment of the integers. Suppose that for different pos- 
sible segment values c, we are given a suggested frequency 
fe > 1/ Yn for integers with that segment value. We further 
require 25 < 1. We can then group the integers spending 


O((1 log f.)/ log + 1/k) 


time per integer with segment value c. 


Proof: The result is achieved by a simple iterative al- 
gorithm. First we sort the frequencies in descending order. 
We start with ¢ = [n1/*] > 2 and repeat squaring t until it 
reaches or passes n‘/*. For a given value of t, we take the 
¢ remaining segment values of highest frequency, and apply 
Lemma 20 to the remaining integers. This groups the inte- 
gers matching the ¢ segment values, leaving the remaining 
integers for the remaining rounds. 

The time spent per integer in a round is O(log(t + 
1)/logn + 1/k) = O(log t/ log n). Since log t/ log n dou- 
bles in each round, it is the last round that an integer x par- 
ticipates in that dominates the time spent on that 7. How- 
ever, if the integer has segment value c with frequency f., 
there can be no more than 1/f, earlier frequencies, so the 
value of t when x is picked is at most 1/f2, or [n'/*] if 
fe < n'/*, Consequently, the total time spent on « is 


O(log max{1/f?, [n'/*]} /logn + 1/k) 
= O((1 — log f-)/logn + 1/k). 


Finally, we have 


Proof of Lemma 17 We want to prove 


Consider n > W® integers packed with at most 
k (log k) integers in each word. We can sort the 
integers according to the value of a given segment 
of at most (logn)/2 consecutive bits so that the 
time spent on an integer with segment value c is 


1+logn —log ne 
o (ARERR ere + 1) (1) 
logn 


where ne is the number of integers with segment 
value c. 


Since there are only n‘/? = O(n/logn) possible segment 
values, we can initiate arrays using these values as entries. 
Thus, with each possible segment value c, we can store a 
counter 7, for the number of integers found with configura- 
tion c. Initially 7. = 0. We also have a list of frequencies 
ve = fi./n for segment values c that are common in the 
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sense that fi, > n3/4, hence with f, > 1/n4/4. Initially, 
this list is empty. 

We divide the integers into batches of n°/* integers. 
First we group the integers in the batch with respect to the 
common segment values using their current frequencies in 
Lemma 21. We note here that if = fi./n > 1/n\/4 = 
1/Vn3/4, so the conditions of Lemma 21 are satisfied. The 
cost for an integer with a common segment value c is 


O((1 — log fe)/log(n*/*) + 1/k) 
= O((1 +log(n/fic))/logn + 1/k) (2) 


4 


All remaining integers in the batch are bucketed using stan- 
dard bucketing in constant time per integer. 

We will now argue that the time spent on inserting the 
ith integer in one of our buckets is 


1+1 —logi 
o (Aen ae at). (3) 
log n 


If i < 2n3/4, (3) is a constant, covering the cost of standard 
bucketing. Otherwise, since each batch adds at most n3/4 
integers, the batch was bucketed with i. > i —n3/4 > i/2. 
By (2), the cost is as in (3) but with 7/2 instead of 2, but this 
does not affect the asymptotic value. Thus, the cost of the 
ith integer is always bounded by (3). 

Now, the cost of adding all n, integers to the bucket for 
segment value c is 


“(1 +logn — log i 
IE +logn os 41) ) 
logn 


i=l 
=) ne(1 + logn) — (ne log ne — ne(log e)) 
log n 
+ne/k) 
1+1 — 1 
= 0 (1. a nai) 
logn 


which divided by n, gives the desired time per integer from 
(1). a 


4 Summing up 


Proof of Theorem 1, Corollary 2, and Theorem 3 The 
results follow directly from the statements of Lemma 4, 12, 
13, 17, and 18. | 


Proof Sketch for Theorem 9 and Corollary 10 We want 
to show 


For any positive €, using standard AC° operations 
only, we can split a set of n (W/(log log n)'**)- 
bit integers in linear time so that diverse subsets 
have size at most \/n. 
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We are can essentially reuse the splitting algorithm for The- 
orem 7. The only non-AC® operation used is multiplica- 
tion, used for hashing. Brodnik et al. in [12] have shown 
(see their remark on BlockMult above Theorem 13) that if 
words are packed with @ bit integers, we can multiply them 
coordinatewise in O((log @)'+*) time using only standard 
AC® operations. We do such packed multiplication on fields 
when we use signature sort in Lemma 15 and in Lemma 16, 
with field lengths of at most (log n)? and (log n)°, respec- 
tively. Also, the rest of the algorithm only considers inte- 
gers of size O((logn)?). Thus the packed simulated multi- 
plication takes O((log log n)!**) time. However, since our 
input integers are of length (W/(log log n)'**), the packed 
multiplications will only be over that many bits. Hence we 
can pack 2((log log n)'**) original packed multiplications, 
thus performing 0((loglogn)!**) original packed multi- 
plications at the time, in O(1) time per original packed mul- 
tiplications. | 
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A_ Variants of results from other papers 


In this appendix we justify some simple variants of re- 
sults stemming from other papers. 


A.1 Signature sorting 


To describe our slight variant of signature sorting for 
Lemma 11, we first review the original signature sorting of 
Andersson et al. [5] is a set S of n integers of length @. Sig- 
nature sorting reduces the problem of sorting S to that of 
two sorting problems, each with n integers, but with shorter 
integers. In the reduction, for some parameter r, we inter- 
pret the integers in S as vectors of r equal sized fields. The 
reduction goes in several steps: 


1. For each integer, each field is hashed into 4 log n bits. 
The hashed fields of an integer are packed together giv- 
ing us a reduced integer of length 4r log n bits. All the 
reduced integers are produced in linear total time. 


2. The reduced integers are now sorted (our first new sort- 
ing problem). 


3. In linear time, we identify n fields from the integers in 


S. 


4. The n fields of £/r bits are now sorted (our second new 
sorting problem). 


5. Based on the sorting done above, we sort S in linear 
time. 


The above high level reduction is carefully implemented in 
[5], to which the reader is referred for details. There is a 
probability of at most 1/n? that something goes wrong in 
the hashing. In [5] they just check the final sorting in the 
end. However, here we will apply signature sorting to many 
small subproblems, a small fraction of which are likely to 
fail. Instead of aiming at no errors at all, we introduce the 
following convenient step between step 2 and step 3. 


24. In expected linear time, redo and resort the reduced 
integers if they are not OK. 


To implement step 24, we need to check that the reduced 
integers are OK. Referring the reader to [5] for details, this 
is easily done in connection of their linear time construc- 
tion of a certain compressed unordered trie Tp, needed for 
steps 3-4. If there is a failure, we return to step 1. However, 
when we iterate, we can just use bubble-sort to sort the re- 
duced integers in O(n”). The point is that the probability 
of iterating is 1/n?, so the expected cost of all the iterations 
is bounded by O(n?) 307°, n~** = O(1). The most ex- 
pensive part of step 24 is therefore the first check which is 
always executed in linear time. 
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Summing up, with an expected linear-time additive over- 
head, signature sorting with parameter r reduces the prob- 
lem of sorting n integers of £ bits to 


(i) the problem in step 2 of sorting n reduced integers of 
4r log n bits and 


(ii) the problem in step 4 of sorting n fields of £/r bits. 


Here (i) has to be solved before (ii). This establishes our 
variant of signature sorting from Lemma 11. 


A.2 Deterministic signature sorting 


We want to show the statement of Lemma 14: 


With an O(n + £8?) time additive overhead, de- 
terministic signature sorting with parameter r re- 
duces the problem of splitting n €-bit integers over 
8 splitters to 


(i) the problem of splitting n reduced integers 
of 4r log n bits over s reduced splitters, and 


(ii) the problem of splitting n fields of €/r bits 
over s field splitters. 


Here (i) has to be solved before (ii). 


Lemma 14 is essentially shown by Han in the beginning 
of Section 8 in [20]. A small difference is in the additive 
O(4s”) term, where Han has O(£s”). This term is the time 
it takes to compute a perfect hash function. We just realize 
that all we need is a hash function on fields such that for 
any pair of splitters, the hash function should give different 
values on their first distinguishing field. Since fields have 
length ¢/r, the hash function is computed in O(4s”) time 
using simple derandomization as described by Raman [31]. 
Moreover, Han’s version could lead to much more than s 
splitters in (ii). We need an analog of a simple trick from the 
original signature sort [5, p. 79]; namely, at each branching 
point in the trie 7'p, to isolate integer fields smaller than 
the smallest splitter field in linear time. This completes the 
proof of Lemma 14. 


A.3 Packed bucketing 


We want to show the statement of Lemma 19: 


Consider n integers packed with k (log k) integers 
in each word, and that an ¢-bit label for each in- 
teger is packed in a parallel set of words. We can 
then sort the integers according to their labels in 
O(£/logn + 1/k) time per integer. 
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The lemma is essentially just a reformulation of Han’s 
Lemma 5 in [21], and is proved using the same proof. The 
O(£/ log n) term is the inherent cost of packed bucketing 
using that the labels are small. The O(1/k) term is cost 
of a matrix transposition of Thorup [36, Lemma 9] with 
k, log k integers in each word. Also, Han has replaced log k 
by log log n using k = O(logn). However, this change is 
not necessary for his proof. Thus Lemma 19 follows. 
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